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Abstract 

Latent  Dirichlet  allocation  (LDA)  is  a  Bayesian  network  that  has  recently  gained 
much  popularity  in  applications  ranging  from  document  modeling  to  computer 
vision.  Due  to  the  large  scale  nature  of  these  applications,  current  inference  pro¬ 
cedures  like  variational  Bayes  and  Gibbs  sampling  have  been  found  lacking.  In 
this  paper  we  propose  the  collapsed  variational  Bayesian  inference  algorithm  for 
LDA,  and  show  that  it  is  computationally  efficient,  easy  to  implement  and  signifi¬ 
cantly  more  accurate  than  standard  variational  Bayesian  inference  for  LDA. 


1  Introduction 

Bayesian  networks  with  discrete  random  variables  form  a  very  general  and  useful  class  of  proba¬ 
bilistic  models.  In  a  Bayesian  setting  it  is  convenient  to  endow  these  models  with  Dirichlet  priors 
over  the  parameters  as  they  are  conjugate  to  the  multinomial  distributions  over  the  discrete  random 
variables  [1].  This  choice  has  important  computational  advantages  and  allows  for  easy  inference  in 
such  models. 

A  class  of  Bayesian  networks  that  has  gained  significant  momentum  recently  is  latent  Dirichlet 
allocation  (LDA)  [2],  otherwise  known  as  multinomial  PCA  [3].  It  has  found  important  applications 
in  both  text  modeling  [4,  5]  and  computer  vision  [6].  Training  LDA  on  a  large  corpus  of  several 
million  documents  can  be  a  challenge  and  crucially  depends  on  an  efficient  and  accurate  inference 
procedure.  A  host  of  inference  algorithms  have  been  proposed,  ranging  from  variational  Bayesian 
(VB)  inference  [2],  expectation  propagation  (EP)  [7]  to  collapsed  Gibbs  sampling  [5]. 

Perhaps  surprisingly,  the  collapsed  Gibbs  sampler  proposed  in  [5]  seem  to  be  the  preferred  choice 
in  many  of  these  large  scale  applications.  In  [8]  it  is  observed  that  EP  is  not  efficient  enough  to 
be  practical  while  VB  suffers  from  a  large  bias.  However,  collapsed  Gibbs  sampling  also  has  its 
own  problems:  one  needs  to  assess  convergence  of  the  Markov  chain  and  to  have  some  idea  of 
mixing  times  to  estimate  the  number  of  samples  to  collect,  and  to  identify  coherent  topics  across 
multiple  samples.  In  practice  one  often  ignores  these  issues  and  collects  as  many  samples  as  is 
computationally  feasible,  while  the  question  of  topic  identification  is  often  sidestepped  by  using 
just  1  sample.  Hence  there  still  seems  to  be  a  need  for  more  efficient,  accurate  and  deterministic 
inference  procedures. 

In  this  paper  we  will  leverage  the  important  insight  that  a  Gibbs  sampler  that  operates  in  a  collapsed 
space — where  the  parameters  are  marginalized  out — mixes  much  better  than  a  Gibbs  sampler  that 
samples  parameters  and  latent  topic  variables  simultaneously.  This  suggests  that  the  parameters 
and  latent  variables  are  intimately  coupled.  As  we  shall  see  in  the  following,  marginalizing  out  the 
parameters  induces  new  dependencies  between  the  latent  variables  (which  are  conditionally  inde¬ 
pendent  given  the  parameters),  but  these  dependencies  are  spread  out  over  many  latent  variables. 
This  implies  that  the  dependency  between  any  two  latent  variables  is  expected  to  be  small.  This  is 
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precisely  the  right  setting  for  a  mean  field  (i.e.  fully  factorized  variational)  approximation;  a  par¬ 
ticular  variable  interacts  with  the  remaining  variables  only  through  summary  statistics  called  the 
field,  and  the  impact  of  any  single  variable  on  the  field  is  very  small  [9].  Note  that  this  is  not  true 
in  the  joint  space  of  parameters  and  latent  variables  because  fluctuations  in  parameters  can  have  a 
significant  impact  on  latent  variables.  We  thus  conjecture  that  the  mean  field  assumptions  are  much 
better  satisfied  in  the  collapsed  space  of  latent  variables  than  in  the  joint  space  of  latent  variables 
and  parameters.  In  this  paper  we  leverage  this  insight  and  propose  a  collapsed  variational  Bayesian 
(CVB)  inference  algorithm. 

In  theory,  the  CVB  algorithm  requires  the  calculation  of  very  expensive  averages.  However,  the 
averages  only  depend  on  sums  of  independent  Bernoulli  variables,  and  thus  are  very  closely  approx¬ 
imated  with  Gaussian  distributions  (even  for  relatively  small  sums).  Making  use  of  this  approxi¬ 
mation,  the  final  algorithm  is  computationally  efficient,  easy  to  implement  and  significantly  more 
accurate  than  standard  VB. 


2  Approximate  Inference  in  Latent  Dirichlet  Allocation 


LDA  models  each  document  as  a  mixture  over  topics.  We  assume  there  are  K  latent  topics,  each 
being  a  multinomial  distribution  over  a  vocabulary  of  size  W.  For  document  j,  we  first  draw  a 
mixing  proportion  9j  =  {9jk}  over  K  topics  from  a  symmetric  Dirichlet  with  parameter  a.  For 
the  ith  word  in  the  document,  a  topic  Zij  is  drawn  with  topic  k  chosen  with  probability  9jk,  then 
word  Xij  is  drawn  from  the  Zijth  topic,  with  Xij  taking  on  value  w  with  probability  (j)kw  Finally,  a 
symmetric  Dirichlet  prior  with  parameter  (3  is  placed  on  the  topic  parameters  (pk  =  {(j>kw}-  The  full 
joint  distribution  over  all  parameters  and  variables  is: 


p{y:,z,9,cf)\a,(3) 
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where  njkw  =  #{*  :  Xij  =  w,  Zij  =  k},  and  dot  means  the  corresponding  index  is  summed  out: 
‘^■kw  —  kljkw^  and  Tljk-  —  ‘kljkw 

Given  the  observed  words  x  =  {xij}  the  task  of  Bayesian  inference  is  to  compute  the  posterior 
distribution  over  the  latent  topic  indices  z  =  {zij},  the  mixing  proportions  6  —  {9j}  and  the  topic 
parameters  4>  =  {4>k}-  There  are  three  current  approaches,  variational  Bayes  (VB)  [2],  expectation 
propagation  [7]  and  collapsed  Gibbs  sampling  [5].  We  review  the  VB  and  collapsed  Gibbs  sam¬ 
pling  methods  here  as  they  are  the  most  popular  methods  and  to  motivate  our  new  algorithm  which 
combines  advantages  of  both. 


2.1  Variational  Bayes 

Standard  VB  inference  upper  bounds  the  negative  log  marginal  likelihood  —  logp(x|a,  (3)  using  the 
variational  free  energy: 

-logp(x|a,/3)  <T{q{z,e,cl)))  =£;,[- logp(x,  z,  6>|a, /?)]  -  H{q{z,e,(j)))  (2) 

with  q{z,  6,  cp)  an  approximate  posterior,  H{q{z,  6,  cp))  =  Eq\—  logg(z,  0,  cp)]  the  variational  en¬ 
tropy,  and  q{z^  0,  cp)  assumed  to  be  fully  factorized: 

q{z,e,cp)  =Y[q{z^j\^ij)Y[q{9j\aj)Y[q{(pk0k)  (3) 

ij  j  k 

q{zij\'pij)  is  multinomial  with  parameters  7^  and  q{9j\aj),  q{(pk\f3k)  are  Dirichlet  with  parameters 
ccj  and  f3k  respectively.  Optimizing  !F{q)  with  respect  to  the  variational  parameters  gives  us  a  set  of 
updates  guaranteed  to  improve  !F{q)  at  each  iteration  and  converges  to  a  local  minimum: 
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where  'I' (y)  =  is  the  digamma  function  and  1  is  the  indicator  function. 

Although  efficient  and  easily  implemented,  VB  can  potentially  lead  to  very  inaccurate  results.  No¬ 
tice  that  the  latent  variables  z  and  parameters  6,  4>  can  be  strongly  dependent  in  the  true  posterior 
p(z,  9,  </>|x)  through  the  cross  terms  in  (1).  This  dependence  is  ignored  in  VB  which  assumes  that 
latent  variables  and  parameters  are  independent  instead.  As  a  result,  the  VB  upper  bound  on  the 
negative  log  marginal  likelihood  can  be  very  loose,  leading  to  inaccurate  estimates  of  the  posterior. 


2.2  Collapsed  Gibbs  Sampling 


Standard  Gibbs  sampling,  which  iteratively  samples  latent  variables  z  and  parameters  9,  cj),  can 
potentially  have  slow  convergence  due  again  to  strong  dependencies  between  the  parameters  and 
latent  variables.  Collapsed  Gibbs  sampling  improves  upon  Gibbs  sampling  by  marginalizing  out  9 
and  4>  instead,  therefore  dealing  with  them  exactly.  The  marginal  distribution  over  x  and  z  is 


p(z,x|a,/3)  = 
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Given  the  current  state  of  all  but  one  variable  Zy ,  the  conditional  probability  of  Zy  is: 


p(zy  =  fc|z"*^x,a,/3)  = 
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where  the  superscript  -^ij  means  the  corresponding  variables  or  counts  with  Xy  and  Zy  excluded, 
and  the  denominator  is  just  a  normalization.  The  conditional  distribution  of  Zy  is  multinomial  with 
simple  to  calculate  probabilities,  so  the  programming  and  computational  overhead  is  minimal. 


Collapsed  Gibbs  sampling  has  been  observed  to  converge  quickly  [5].  Notice  from  (8)  that  Zy 
depends  on  z^*-’  only  through  the  counts  .  In  particular,  the  dependence  of  Zy  on 

any  particular  other  variable  Zi'ji  is  very  weak,  especially  for  large  datasets.  As  a  result  we  expect  the 
convergence  of  collapsed  Gibbs  sampling  to  be  fast  [10].  However,  as  with  other  MCMC  samplers, 
and  unlike  variational  inference,  it  is  often  hard  to  diagnose  convergence,  and  a  sufficiently  large 
number  of  samples  may  be  required  to  reduce  sampling  noise. 

The  argument  of  rapid  convergence  of  collapsed  Gibbs  sampling  is  reminiscent  of  the  argument  for 
when  mean  field  algorithms  can  be  expected  to  be  accurate  [9].  The  counts  act  as 

fields  through  which  Zy  interacts  with  other  variables.  In  particular,  averaging  both  sides  of  (8)  by 
|x,  a,  0)  gives  us  the  Callen  equations,  a  set  of  equations  that  the  true  posterior  must  satisfy: 


P{^ij  A:|x,  Cr,/?)  |x,a,/5) 


Ek>=M+n]ll)iP+n::iJiWP+n::;’.) 


v\-i 


(9) 


Since  the  latent  variables  are  already  weakly  dependent  on  each  other,  it  is  possible  to  replace  (9) 
by  a  set  of  mean  field  equations  where  latent  variables  are  assumed  independent  and  still  expect 
these  equations  to  be  accurate.  This  is  the  idea  behind  the  collapsed  variational  Bayesian  inference 
algorithm  of  the  next  section. 


3  Collapsed  Variational  Bayesian  Inference  for  LDA 

We  derive  a  new  inference  algorithm  for  LDA  combining  the  advantages  of  both  standard  VB  and 
collapsed  Gibbs  sampling.  It  is  a  variational  algorithm  which,  instead  of  assuming  independence, 
models  the  dependence  of  the  parameters  on  the  latent  variables  in  an  exact  fashion.  On  the  other 
hand  we  still  assume  that  latent  variables  are  mutually  independent.  This  is  not  an  unreasonable 
assumption  to  make  since  as  we  saw  they  are  only  weakly  dependent  on  each  other.  We  call  this 
algorithm  collapsed  variational  Bayesian  (CVB)  inference. 

There  are  two  ways  to  deal  with  the  parameters  in  an  exact  fashion,  the  first  is  to  marginalize  them 
out  of  the  joint  distribution  and  to  start  from  (7),  the  second  is  to  explicitly  model  the  posterior  of 
9,  (p  given  z  and  x  without  any  assumptions  on  its  form.  We  will  show  that  these  two  methods 


are  equivalent.  The  only  assumption  we  make  in  CVB  is  that  the  latent  variables  z  are  mutually 
independent,  thus  we  approximate  the  posterior  as: 

q{z,e,cj))  =q{e,(j)\z)Y[qizij\%)  (10) 

where  q{zij\^ij )  is  multinomial  with  parameters  'jij .  The  variational  free  energy  becomes: 

T{q{z)q{9,  </.|z))  =  T;,j(z)g(e>|z)  [-  logp(x,  z,  9,  (j}\a,  /?)]  -  n{q{z)q{9,  0|z)) 

=-E^?(z)  [Eqie.^z)  [-  logp(x,  z,  9,  (j)\a,  /?)]  -  H{q{9,  (/)|z))]  -  H{q{z))  (11) 

We  minimize  the  variational  free  energy  with  respect  to  q{9,  (j)\z)  first,  followed  by  q{z).  Since 
we  do  not  restrict  the  form  of  q(9,  (/)|z),  the  minimum  is  achieved  at  the  true  posterior  q(9,  cf)\z)  = 
p{9,  cf)\x,  z,  a,  (3),  and  the  variational  free  energy  simplifies  to: 

T{q{z))  =  min  T{q{z)q{9,  0|z))  =  [-  logp(x,  z\a,  (3)]  -  H(g(z))  (12) 

(3(6».</>|z) 

We  see  that  CVB  is  equivalent  to  marginalizing  out  9,  4>  before  approximating  the  posterior  over  z. 
As  CVB  makes  a  strictly  weaker  assumption  on  the  variational  posterior  than  standard  VB,  we  have 

•?(9(z))  <  •?(<7(z))  =  min  T {q{z)q{9)q{cj)))  (13) 


and  thus  CVB  is  a  better  approximation  than  standard  VB.  Finally,  we  derive  the  updates  for  the 
variational  parameters  7^ .  Minimizing  (12)  with  respect  to  77-^,  we  get 


Irjk  =  q{Zrj  =  k) 
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Plugging  in  (7),  expanding  log  log(?7  +  1)  for  positive  reals  7  and  positive  integers 

n,  and  cancelling  terms  appearing  both  in  the  numerator  and  denominator,  we  get 
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exp  (F;^(^.y)[log(a+nJfc:')  +  log(/3+n:;*^.p  -  log(lV/3+nX^)]) 
J2k'=i  exp  (^E^(^^i,)[log{a+nJll)  +  log(/3+n:;*4^.)  "  log(lC/3+n4V)]) 


(15) 


3.1  Gaussian  approximation  for  CVB  Inference 


For  completeness,  we  describe  how  to  compute  each  expectation  term  in  (15)  exactly  in  the  ap¬ 
pendix.  This  exact  implementation  of  CVB  is  computationally  too  expensive  to  be  practical,  and 
we  propose  instead  to  use  a  simple  Gaussian  approximation  which  works  very  accurately  and  which 
requires  minimal  computational  costs. 

In  this  section  we  describe  the  Gaussian  approximation  applied  to  £^q[log(Q;  +  the  other 

two  expectation  terms  are  similarly  computed.  Assume  that  rij..  ^  0.  Notice  that  = 

=  fc)  is  a  sum  of  a  large  number  independent  Bernoulli  variables  l{ziij  =  k)  each 
with  mean  parameter  ^i'jk,  thus  it  can  be  accurately  approximated  by  a  Gaussian.  The  mean  and 
variance  are  given  by  the  sum  of  the  means  and  variances  of  the  individual  Bernoulli  variables: 

^qV^k-]  =  XI  ^  %'jk{l  -  %'jk)  (16) 


We  further  approximate  the  function  log(Q;  +  n^^,)  using  a  second-order  Taylor  expansion  about 
^qV^k-]^  and  evaluate  its  expectation  under  the  Gaussian  approximation: 


F;^[log(a  -f  «  log(a  -f  Eq[n]l^,]) 


2{a  +  Eq[n-]l^,]Y 


(17) 


Because  Eq[n^l3,]  ^  0,  the  third  derivative  is  small  and  the  Taylor  series  approximation  is  very 
accurate.  In  fact,  we  have  found  experimentally  that  the  Gaussian  approximation  works  very  well 


even  when  rij..  is  small.  The  reason  is  that  we  often  have  being  either  close  to  0  or  1  thus 
the  variance  of  is  small  relative  to  its  mean  and  the  Gaussian  approximation  will  be  accurate. 
Finally,  plugging  (17)  into  (15),  we  have  our  CVB  updates: 
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Notice  the  striking  correspondence  between  (18),  (8)  and  (9),  showing  that  CVB  is  indeed  the  mean 
field  version  of  collapsed  Gibbs  sampling.  In  particular,  the  first  line  in  (18)  is  obtained  from  (8) 
by  replacing  the  fields  and  by  their  means  (thus  the  term  mean  field)  while  the 

exponentiated  terms  are  correction  factors  accounting  for  the  variance  in  the  fields. 


CVB  with  the  Gaussian  approximation  is  easily  implemented  and  has  minimal  computational  costs. 
By  keeping  track  of  the  mean  and  variance  of  Ujk-,  n.kw  and  n.k-,  and  subtracting  the  mean  and 
variance  of  the  corresponding  Bernoulli  variables  whenever  we  require  the  terms  with  Xij ,  Zij  re¬ 
moved,  the  computational  cost  scales  only  as  0{K)  for  each  update  to  q{zij).  Further,  we  only 
need  to  maintain  one  copy  of  the  variational  posterior  over  the  latent  variable  for  each  unique  docu- 
ment/word  pair,  thus  the  overall  computational  cost  per  iteration  of  CVB  scales  as  0{MK)  where 
M  is  the  total  number  of  unique  document/word  pairs,  while  the  memory  requirement  is  0{MK). 
This  is  the  same  as  for  VB.  In  comparison,  collapsed  Gibbs  sampling  needs  to  keep  track  of  the 
current  sample  of  Zij  for  every  word  in  the  corpus,  thus  the  memory  requirement  is  0{N)  while  the 
computational  cost  scales  as  0{NK)  where  N  is  the  total  number  of  words  in  the  corpus — higher 
than  for  VB  and  CVB.  Note  however  that  the  constant  factor  involved  in  the  0{NK)  time  cost  of 
collapsed  Gibbs  sampling  is  significantly  smaller  than  those  for  VB  and  CVB. 


4  Experiments 


We  compared  the  three  algorithms  described  in  the  paper:  standard  VB,  CVB  and  collapsed  Gibbs 
sampling.  We  used  two  datasets:  first  is  “KOS”  (www.dailykos.com),  which  has  J  =  3430  docu¬ 
ments,  a  vocabulary  size  of  FF  =  6909,  a  total  of  TV  =  467,  714  words  in  all  the  documents  and  on 
average  136  words  per  document.  Second  is  “NIPS”  (books. nips. cc)  with  J  =  1675  documents,  a 
vocabulary  size  ofW  =  12419,  =  2, 166,  029  words  in  the  corpus  and  on  average  1293  words  per 

document.  In  both  datasets  stop  words  and  infrequent  words  were  removed.  We  split  both  datasets 
into  a  training  set  and  a  test  set  by  assigning  10%  of  the  words  in  each  document  to  the  test  set.  In 
all  our  experiments  we  used  a  =  0.1,/3  =  0.1,iV  =  8  number  of  topics  for  KOS  and  AT  =  40  for 
NIPS.  We  ran  each  algorithm  on  each  dataset  50  times  with  different  random  initializations. 


Performance  was  measured  in  two  ways.  First  using  variational  bounds  of  the  log  marginal  proba¬ 
bilities  on  the  training  set,  and  secondly  using  log  probabilities  on  the  test  set.  Expressions  for  the 
variational  bounds  are  given  in  (2)  for  VB  and  (12)  for  CVB.  For  both  VB  and  CVB,  test  set  log 
probabilities  are  computed  as: 
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(3  Eq[n 
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(19) 


Note  that  we  used  estimated  mean  values  of  9jk  and  (j)kw  [1  !]■  For  collapsed  Gibbs  sampling,  given 
S  samples  from  the  posterior,  we  used: 
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Figure  1  summarizes  our  results.  We  show  both  quantities  as  functions  of  iterations  and  as  his¬ 
tograms  of  final  values  for  all  algorithms  and  datasets.  CVB  converged  faster  and  to  significantly 
better  solutions  than  standard  VB;  this  confirms  our  intuition  that  CVB  provides  much  better  approx¬ 
imations  than  VB.  CVB  also  converged  faster  than  collapsed  Gibbs  sampling,  but  Gibbs  sampling 
attains  a  better  solution  in  the  end;  this  is  reasonable  since  Gibbs  sampling  should  be  exact  with 
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Figure  1;  Left:  results  for  KOS.  Right:  results  for  NIPS.  First  row:  per  word  variational  bounds  as  functions 
of  numbers  of  iterations  of  VB  and  CVB.  Second  row:  histograms  of  converged  per  word  variational  bounds 
across  random  initializations  for  VB  and  CVB.  Third  row:  test  set  per  word  log  probabilities  as  functions 
of  numbers  of  iterations  for  VB,  CVB  and  Gibbs.  Fourth  row:  histograms  of  final  test  set  per  word  log 
probabilities  across  50  random  initializations. 


Figure  2:  Left:  test  set  per  word  log  probabilities.  Right:  per  word  variational  bounds.  Both  as  functions  of 
the  number  of  documents  for  KOS. 


enough  samples.  We  have  also  applied  the  exact  but  much  slower  version  of  CVB  without  the  Gaus¬ 
sian  approximation,  and  found  that  it  gave  identical  results  to  the  one  proposed  here  (not  shown). 

We  have  also  studied  the  dependence  of  approximation  accuracies  on  the  number  of  documents  in 
the  corpus.  To  conduct  this  experiment  we  train  on  90%  of  the  words  in  a  (growing)  subset  of  the 
corpus  and  test  on  the  corresponding  10%  left  out  words.  In  figure  Figure  2  we  show  both  variational 
bounds  and  test  set  log  probabilities  as  functions  of  the  number  of  documents  J.  We  observe  that  as 
expected  the  variational  methods  improve  as  J  increases.  However,  perhaps  surprisingly,  CVB  does 
not  suffer  as  much  as  VB  for  small  values  of  J,  even  though  one  might  expect  that  the  Gaussian 
approximation  becomes  dubious  in  that  regime. 

5  Discussion 

We  have  described  a  collapsed  variational  Bayesian  (CVB)  inference  algorithm  for  LDA.  The  al¬ 
gorithm  is  easy  to  implement,  computationally  efficient  and  more  accurate  than  standard  VB.  The 
central  insight  of  CVB  is  that  instead  of  assuming  parameters  to  be  independent  from  latent  vari¬ 
ables,  we  treat  their  dependence  on  the  topic  variables  in  an  exact  fashion.  Because  the  factorization 
assumptions  made  by  CVB  are  weaker  than  those  made  by  VB,  the  resulting  approximation  is  more 
accurate.  Computational  efficiency  is  achieved  in  CVB  with  a  Gaussian  approximation,  which  was 
found  to  be  so  accurate  that  there  is  never  a  need  for  exact  summation. 

The  idea  of  integrating  out  parameters  before  applying  variational  inference  has  been  indepen¬ 
dently  proposed  by  [12].  Unfortunately,  because  they  worked  in  the  context  of  general  conjugate- 
exponential  families,  the  approach  cannot  be  made  generally  computationally  useful.  Nevertheless, 
we  believe  the  insights  of  CVB  can  be  applied  to  a  wider  class  of  discrete  graphical  models  beyond 
LDA.  Specific  examples  include  various  extensions  of  LDA  [4,  13]  hidden  Markov  models  with  dis¬ 
crete  outputs,  and  mixed-membership  models  with  Dirichlet  distributed  mixture  coefficients  [14]. 
These  models  all  have  the  property  that  they  consist  of  discrete  random  variables  with  Dirichlet 
priors  on  the  parameters,  which  is  the  property  allowing  us  to  use  the  Gaussian  approximation.  We 
are  also  exploring  CVB  on  an  even  more  general  class  of  models,  including  mixtures  of  Gaussians, 
Dirichlet  processes,  and  hierarchical  Dirichlet  processes. 

Over  the  years  a  variety  of  inference  algorithms  have  been  proposed  based  on  a  combination  of 
{maximize,  sample,  assume  independent,  marginalize  out}  applied  to  both  parameters  and  latent 
variables.  We  conclude  by  summarizing  these  algorithms  in  Table  1,  and  note  that  CVB  is  located 
in  the  marginalize  out  parameters  and  assume  latent  variables  are  independent  cell. 


A  Exact  Computation  of  Expectation  Terms  in  (15) 

We  can  compute  the  expectation  terms  in  (15)  exactly  as  follows.  Consider  i?g[log(a-|-n~'^'’)], 
which  requires  computing  (other  expectation  terms  are  similarly  computed).  Note  that 


Parameters  ^ 

J,  Latent  variables 

maximize 

sample 

assume 

independent 

marginalize 

out 

maximize 

Viterbi  EM 

? 

ME 

ME 

sample 

stochastic  EM 

Gibbs  sampling 

? 

collapsed  Gibbs 

assume  independent 

variational  EM 

? 

VB 

CVB 

marginalize  out 

EM 

any  MCMC 

EP  for  LDA 

intractable 

Table  1 :  A  variety  of  inference  algorithms  for  graphical  models.  Note  that  not  every  cell  is  Hlled  in  (marked 
hy  ?)  while  some  are  simply  intractable.  “ME”  is  the  maximization-expectation  algorithm  of  [15]  and  “any 
MCMC”  means  that  we  can  use  any  MCMC  sampler  for  the  parameters  once  latent  variables  have  been 
marginalized  out. 


~  =  fc)  is  a  sum  of  independent  Bernoulli  variables  l(zi/j  =  k)  each  with  mean 

parameter  Define  vectors  Vi'jk  =  [(1  -  7i'jfc),  and  let  vjk  =  vijk  i8)  •  •  •  0  Vn.jjk  be 

the  convolution  of  all  Vi'jk-  Finally  let  vjjf  be  Vjk  deconvolved  by  Vijk-  Then  q{n^k,  =  m)  will 
be  the  (m+l)st  entry  in  The  expectation  Eg[\og{a  +  nJl^)]  can  now  be  computed  explicitly. 
This  exact  implementation  requires  an  impractical  0{nj..)  time  to  compute  i?^[log(a-|-n~'^'’)].  At 
the  expense  of  complicating  the  algorithm  implementation,  this  can  be  improved  by  sparsifying  the 
vectors  Vjk  (setting  small  entries  to  zero)  as  well  as  other  computational  tricks.  We  propose  instead 
the  Gaussian  approximation  of  Section  3.1,  which  we  have  found  to  give  extremely  accurate  results 
but  with  minimal  implementation  complexity  and  computational  cost. 
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